Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and Its Variants

The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error-prone process of genome assembly. Ebrahimpour Boroojeny et al. (2018) propose two ILP formulations for GTED and claim that GTED is polynomially solvable because the linear programming relaxation of one of the ILPs always yields optimal integer solutions. The claim that GTED is polynomially solvable is contradictory to the complexity results of existing string-to-graph matching problems. We resolve this conflict in complexity results by proving that GTED is NP-complete and showing that the ILPs proposed by Ebrahimpour Boroojeny et al. do not solve GTED but instead solve for a lower bound of GTED and are not solvable in polynomial time. In addition, we provide the first two, correct ILP formulations of GTED and evaluate their empirical efficiency. These results provide solid algorithmic foundations for comparing genome graphs and point to the direction of heuristics.


Introduction
Graph traversal edit distance (GTED) [1] is an elegant measure of the similarity between the strings represented by edge-labeled Eulerian graphs. For example, given two de Bruijn assembly graphs [2], computing GTED between them measures the similarity between two genomes without the computationally intensive and possibly error-prone process of assembling the genomes. Using an approximation of GTED between assembly graphs of Hepatitis B viruses, Ebrahimpour Boroojeny et al. [1] group the viruses into clusters consistent with their taxonomy. This can be extended to inferring phylogeny relationships in metagenomic communities or comparing heterogeneous disease samples such as cancer. There are several other methods to compute a similarity measure between strings encoded by two assembly graphs [3][4][5]. GTED has the advantage that it does not require prior knowledge on the type of the genome graph or the complete sequence of the input genomes. The input to the GTED problem is two unidirectional, edge-labeled Eulerian graphs, which are defined as:

Definition 1 (Unidirectional, edge-labeled Eulerian Graph). A unidirectional, edge-labeled
Eulerian graph is a connected directed graph G = (V, E, , Σ), with node set V , edge multiset E, constant-size alphabet Σ, and single-character edge labels : E → Σ, such that G contains an Eulerian trail that traverses every edge e ∈ E exactly once. The unidirectional condition means that all edges between the same pair of nodes are in the same direction.
Such graphs arise in genome assembly problems (e.g. the de Bruijn subgraphs). Computing GTED is the problem of computing the minimum edit distance between the two most similar strings represented by Eulerian trails each input graph.
Ebrahimpour Boroojeny et al. [1] claim that GTED is polynomially solvable by proposing an integer linear programming (ILP) formulation of GTED and arguing that the constraints of the ILP make it polynomially solvable. This result, however, conflicts with several complexity results on string-to-graph matching problems. Kupferman and Vardi [6] show that it is NP-complete to determine if a string exactly matches an Eulerian tour in an edge-labeled Eulerian graph. Additionally, Jain et al. [7] show that it is NP-complete to compute an edit distance between a string and strings represented by a labeled graph if edit operations are allowed on the graph. On the other hand, polynomial-time algorithms exist to solve string-to-string alignment [8] and string-to-graph alignment [7] when edit operations on graphs are not allowed.
We resolve the conflict among the results on complexity of graph comparisons by revisiting the complexity of and the proposed solutions to GTED. We prove that computing GTED is NP-complete by reducing from the Hamiltonian Path problem, reaching an agreement with other related results on complexity. Further, we point out with a counterexample that the optimal solution of the ILP formulation proposed by Ebrahimpour Boroojeny et al. [1] does not solve GTED.
We give two ILP formulations for GTED. The first ILP has an exponential number of constraints and can be solved by subtour elimination iteratively [9, 10]. The second ILP has a polynomial number of constraints and shares a similar high-level idea of the global ordering approach [10] in solving the Traveling Salesman problem [11].
In Qiu and Kingsford [12], Flow-GTED (FGTED), a variant of GTED is proposed to compare two sets of strings instead of two strings encoded by graphs. FGTED is equal to the edit distance between the most similar sets of strings spelled by the decomposition of flows between a pair of predetermined source and sink nodes. The similarity between the sets of strings reconstructed from the flow decomposition is measured by the Earth Mover's Edit Distance [12,13]. FGTED is used to compare pan-genomes, where both the frequency and content of strings are essential to represent the population of organisms.
Qiu and Kingsford [12] reduce FGTED to GTED, and via the claimed polynomial-time algorithm of GTED, argue that FGTED is also polynomially solvable. We show that this claim is false by proving that FGTED is also NP-complete.
While the optimal solution to ILP proposed in Ebrahimpour Boroojeny et al. [1] does not solve GTED, it does compute a lower bound to GTED. We characterize the cases when GTED is equal to this lower bound. In addition, we point out that solving this ILP formulation finds a minimum-cost matching between closed-trail decompositions in the input graphs, which may be used to compute the similarity between repeats in the genomes. Ebrahimpour Boroojeny et al. [1] claim their proposed ILP formulation is solvable in polynomial time by arguing that the constraint matrix of the linear relaxation of the ILP is always totally unimodular. We show that this claim is false by proving that the constraint matrix is not always totally unimodular and showing that there exists optimal fractional solutions to its linear relaxation.
We evaluate the efficiency of solving ILP formulations for GTED and its lower bound on simulated genomic strings and show that it is impractical to compute GTED on larger genomes.
In summary, we revisit two important problems in genome graph comparisons: Graph Traversal Edit Distance (GTED) and its variant FGTED. We show that both GTED and FGTED are NP-complete, and provide the first correct ILP formulations for GTED. We also show that the ILP formulation proposed by [1] is a lower bound to GTED. We evaluate the efficiency of the ILPs for GTED and its lower bound on genomic sequences. These results provide solid algorithmic foundations for continued algorithmic innovation on the task of comparing genome graphs and point to the direction of approximation heuristics.
2. GTED and FGTED are NP-complete 2.1 Conflicting results on computational complexity of GTED and stringto-graph matching The natural decision versions of all of the computational problems described above and below are clearly in NP. Under the assumption that P = NP, the results on the computational complexity of GTED and string-to-graph matching claimed in Ebrahimpour Boroojeny et al. [1] and Kupferman and Vardi [6], respectively, cannot be both true.
Kupferman and Vardi [6] show that the problem of determining if an input string can be spelled by concatenating edge labels in an Eulerian trail in an input graph is NP-complete.
We call this problem Eulerian Trail Equaling Word. We show in Theorem 1 that we can reduce ETEW to GTED, and therefore if GTED is polynomially solvable, then ETEW is polynomially solvable. The complete proof is in Appendix A.1.
Proof sketch. We first convert an input instance s, G to ETEW into an input instance G 1 , G 2 to GTED by (a) creating graph G 1 that only contains edges that reconstruct string s and (b) modifying G into G 2 by extending the anti-parallel edges so that G 2 is unidirectional. We show that if GTED(G 1 , G 2 ) = 0, there must be an Eulerian trail in G that spells s, and if GTED(G 1 , G 2 ) > 0, G must not contain an Eulerian trail that spells s.
Hence, an (assumed) polynomial-time algorithm for GTED solves ETEW in polynomial time. This contradicts Theorem 6 of Kupferman and Vardi [6] of the NP-completeness of ETEW (under P = NP).

Reduction from Hamiltonian Path to GTED and FGTED
We resolve the contradiction by showing that GTED is NP-complete. The details of the proof are in Appendix A.2.
Proof sketch. We reduce from the Hamiltonian Path problem, which asks whether a directed, simple graph G contains a path that visits every vertex exactly once. Here simple means no self-loops or parallel edges. The reduction is almost identical to that presented in Kupferman and Vardi [6], and from here until noted later in the proof the argument is identical except for the technicalities introduced to force unidirectionality (and another minor change described later).
Let G = (V, E) be an instance of Hamiltonian Path, with n = |V | vertices. We first create the Eulerian closure of G, which is defined as Here, each vertex in V is split into v in and v out , and w is a newly added vertex. E is the union of the following sets of edges and their labels: G is an Eulerian graph by construction but contains anti-parallel edges. We further create G from G by adding dummy nodes so that each pair of antiparallel edges is split into two parallel, length-2 paths with labels x#, where x is the original label.
We also create a graph C that has the same number of edges as G and spells out a We then argue that G has a Hamiltonian path if and only if G spells out the string q, which uses the same line of arguments and graph traversals as in Kupferman and Vardi [6].
We then show that GTED(G , C) = 0 if and only if G spells q.
Following a similar argument, we show that FGTED is also NP-complete, and its proof is in Appendix A.3.

Revisiting the correctness of the proposed ILP solutions to GTED
In this section, we revisit two proposed ILP solutions to GTED by Ebrahimpour Boroojeny et al. [1] and show that the optimal solution to these ILP is not always equal to GTED.

Alignment graph
The previously proposed ILP formulations for GTED are based on the alignment graph constructed from input graphs. The high-level concept of an alignment graph is similar to the dynamic programming matrix for the string-to-string alignment problem [8].
Definition 2 (Alignment graph). Let G 1 , G 2 be two unidirectional, edge-labeled Eulerian graphs. The alignment graph A(G 1 , G 2 ) = (V, E, δ) is a directed graph that has vertex set V = V 1 × V 2 and edge multi-set E that equals the union of the following: Figure 1: (a) An example of two edge labeled Eulerian graphs G 1 (top) and G 2 (bottom). (b) The alignment graph A(G 1 , G 2 ). The cycle with red edges is the path corresponding to GTED(G 1 , G 2 ). Red solid edges are matches with cost 0 and red dashed-line edge is mismatch with cost 1.
Each edge is associated with a cost by the cost function δ : E → R.
Each diagonal edge e = [(u 1 , v 1 ), (u 2 , v 2 )] in an alignment graph can be projected to (u 1 , v 1 ) and (u 2 , v 2 ) in G 1 and G 2 , respectively. Similarly, each vertical edge can be projected to one edge in G 1 , and each horizontal edge can be projected to one edge in G 2 .
We define the edge projection function π i that projects an edge from the alignment graph to an edge in the input graph G i . We also define the path projection function Π i that projects a trail in the alignment graph to a trail in the input graph G i . For example, let a trail in the alignment graph be p = (e 1 , e 2 , . . . , e m ), and Π i (p) = (π i (e 1 ), π i (e 2 ), . . . , π i (e m )) is a trail in G i .
An example of an alignment graph is shown in Figure 1(b). The horizontal edges correspond to gaps in strings represented by G 1 , vertical edges correspond to gaps in strings represented by G 2 , and diagonal edges correspond to the matching between edge labels from the two graphs. In the rest of this paper, we assume that the costs for horizontal and vertical edges are 1, and the costs for the diagonal edges are 1 if the diagonal edge represents a mismatch and 0 if it is a match. The cost function δ can be defined to capture the cost of matching between edge labels or inserting gaps. This definition of alignment graph is also a generalization of the alignment graph used in string-to-graph alignment [7].

The first previously proposed ILP for GTED
where δ(c) is the total edge cost of c, and Π i (c) is the projection from c to G i .
An example of such a minimum-cost trail is shown in Figure 1(b). Ebrahimpour Boroojeny et al. [1] provide the following ILP formulation and claim that it is a direct translation of Lemma 1: subject to Ax = 0 (6) e∈E x e I i (e, f ) = 1 for i = 1, 2 and for all f ∈ E i (7) Here, E is the edge set of A(G 1 , G 2 ). A is the negative incidence matrix of size |V | × |E|, and I i (e, f ) is an indicator function that is 1 if edge e in E projects to edge f in the input graph G i (and 0 otherwise). We define the domain of each x e to include all non-negative integers. However, due to constraints (7), the values of x e are limited to either 0 or 1.
We describe this ILP formulation with the assumption that both input graphs have closed Eulerian trails, which means that each node has equal numbers of incoming and outgoing edges. We discuss the cases when input graphs contain open Eulerian trails in Section 4.
While the ILP in (5)-(8) allows the solutions to select disjoint cycles in the alignment graph, the projection of edges in these disjoint cycles does not correspond to a single string represented by either of the input graphs. We show that the ILP in (5)  the lower bound on the ILP in (5)-(8). This optimal objective value, however, is smaller than GTED(G 1 , G 2 ). Therefore, the ILP in (5)-(8) does not solve GTED since it allows the solution to be a set of disjoint components.

The second previously proposed ILP formulation of GTED
We describe the second proposed ILP formulation of GTED by Ebrahimpour Boroojeny et al. [1]. Following Ebrahimpour Boroojeny et al. [1], we use simplices, a notion from geometry, to generalize the notion of an edge to higher dimensions. A k-simplex is a kdimensional polytope which is the convex hull of its k + 1 vertices. For example, a 1-simplex is an undirected edge, and a 2-simplex is a triangle. We use the orientation of a simplex, which is given by the ordering of the vertex set of a simplex up to an even permutation, to  unique orientations, and we use the signed coefficient to connect their forms together, e.g.
For each pair of graphs G 1 and G 2 and their alignment graph A(G 1 , G 2 ), we define an oriented 2-simplex set T (G 1 , G 2 ) which is the union of: We use the boundary operator [14, p. 28], denoted by ∂, to map an oriented k-simplex to a sum of oriented (k − 1)-simplices with signed coefficients.
wherev i denotes the vertex v i is to be deleted. Intuitively, the boundary operator maps the oriented k-simplex to a sum of oriented (k − 1)-simplices such that their vertices are in the k-simplex and their orientations are consistent with the orientation of the k-simplex. For example, when k = 2, we have: We reiterate the second ILP formulation proposed in Ebrahimpour Boroojeny et al. [1].
[∂] is a |E| × |T (G 1 , G 2 )| boundary matrix where each entry [∂] i,j is the signed coefficient of the oriented 1-simplex (the directed edge) in E corresponding to x i in the boundary of the oriented 2-simplex in T (G 1 , G 2 ) corresponding to y j . The index i, j for each 1-simplex or 2-simplex is assigned based on an arbitrary ordering of the 1-simplices in E or the 2-simplices in T (G 1 , G 2 ). An example of the boundary matrix is shown in Figure 3.
If the Eulerian trail is closed in G i , s i can be any vertex in V i . An example of x init is shown in Figure 2(b).
We provide a complete proof in Section B of the Appendix that the ILP in (5)-(8) is equivalent to the ILP in (11)-(12). Therefore, the example we provided in Section 3.2 is also an optimal solution to the ILP in (11)-(12) but not a solution to GTED. Thus, the ILP in (11)-(12) does not always solve GTED.

New ILP solutions to GTED
To ensure that our new ILP formulations are applicable to input graphs regardless of whether they contain an open or closed Eulerian trail, we add a source node s and a sink node t to the alignment graph. Figure 4 illustrates three possible cases of input graphs.
1. If only one of the input graphs has closed Eulerian trails, wlog, let According to Lemma 1, we can solve GTED(G 1 , G 2 ) by finding a trail in A(G 1 , G 2 ) that satisfies the projection requirements. This is equivalent to finding a s-t trail in A(G 1 , G 2 ) that satisfies constraints: where I i (e, f ) = 1 if the alignment edge e projects to f in G i . An optimal solution to GTED in the alignment graph must start and end with the source and sink node because they are connected to all possible starts and ends of Eulerian trails in the input graphs.
Since a trail in A(G 1 , G 2 ) is a flow network, we use the following flow constraints to enforce the equality between the number of in-and out-edges for each node in the alignment graph except the source and sink nodes. (s,u)∈E Constraints (13) and (16) are equivalent to constraints (7) and (6), respectively. Therefore, we rewrite the ILP in (5) in order to formulate an ILP for the GTED problem, it is necessary to devise constraints that prevent disjoint SCCs from being selected in the alignment graph. In the following, we describe two approaches for achieving this.

Enforcing one trail in the alignment graph via constraint generation
Section 3.2 of Dias et al.
[10] proposes a method to design linear constraints for eliminating disjoint SCCs, which can be directly adapted to our problem. Let C be the collection of all strongly connected subgraphs of the alignment graph A(G 1 , G 2 ). We use the following constraint to enforce that the selected edges form one s-t trail in the alignment graph: where E(C) is the set of edges in the strongly connected subgraph C and ε + (C) is the set of x uv , and (u,v)∈ε + (C) x uv ≥ 1 guarantees that there exists an out-going edge of C that is in the subgraph.
We use the same technique as Dias et al.
[10] to linearize the "if-then" condition in (17) by introducing a new variable β for each strongly connected component: To summarize, given any pair of unidirectional, edge-labeled Eulerian graphs G 1 and G 2 and their alignment graph

A compact ILP for GTED with polynomial number of constraints
In the worst cases, the number of iterations to solve (exponential ILP) via constraint generation is exponential. As an alternative, we introduce a compact ILP with only a polynomial number of constraints. The intuition behind this ILP is that we can impose a partially increasing ordering on all the edges so that the selected edges forms a s-t trail in the alignment graph. This idea is similar to the Miller-Tucker-Zemlin ILP formulation of the Travelling We add variables d uv that are constrained to provide a partial ordering of the edges in the s-t trail and set the variables d uv to zero for edges that are not selected in the s-t trail. Intuitively, there must exist an ordering of edges in a s-t trail such that for each pair of consecutive edges (u, v) and (v, w), the difference in their order variable d uv and d vw is 1. Therefore, for each node v that is not the source or the sink, if we sum up the order variables for the incoming edges and outgoing edges respectively, the difference between the two sums is equal to the number of selected incoming/outgoing edges. Lastly, the order variable for the edge starting at source is 1, and the order variable for the edge ending at sink is the number of selected edges. This gives the ordering constraints as follows: We enforce that all variables x e ∈ {0, 1} and d e ∈ N for all e ∈ E.
The "if-then" statement in Equation (22) Here, y uv is an indicator of whether x uv ≥ 0. The coefficient |E| is the number of edges in the alignment graph and also an upper bound on the ordering variables. When y uv = 1, d uv ≤ 0, and y uv does not impose constraints on x uv . When y uv = 0, x uv ≥ 1, and y uv does not impose constraints on d uv . As we show in Lemma 3 (Appendix D), these constraints prevent finding disjoint components, thus guaranteeing the correctness of the ILP.

A lower bound on GTED
While the (lower bound ILP) and the ILP in (11)-(12) do not solve GTED, the optimal solutions to these ILPs are lower bounds on GTED. These ILPs solve an interesting variant of GTED (Appendix E), which is a local similarity measure between two genome graphs.
We call this variant Closed-trail Cover Traversal Edit Distance (CCTED).
Let the variables in an optimal solution to (lower bound ILP) be x opt and the optimal objective value be c opt . Since the constraints for (lower bound ILP) are a subset of (exponential ILP), and two ILPs have the same objective function, c opt ≤ GTED(G 1 , G 2 ) for any pair of graphs.
Moreover, when the solution to (lower bound ILP) forms only one connected component, the optimal value of (lower bound ILP) is equal to GTED.
has one connected component. A feasible solution to (exponential ILP) is always a feasible solution to (lower bound ILP), and since c opt = GTED(G 1 , G 2 ), an optimal solution to (exponential ILP) is also an optimal solution to (lower bound ILP), which can induce a subgraph in the alignment graph that only contains one connected component.
Conversely, if x opt induces a subgraph in the alignment graph with only one con-nected component, it satisfies constraints (18)-(21) and therefore is feasible to the ILP for GTED (exponential ILP). Since c opt ≤ GTED(G 1 , G 2 ), this solution must also be optimal for GTED(G 1 , G 2 ).
In practice, we may estimate GTED approximately by the solution to (lower bound ILP).
As we show in Section 6, the time needed to solve (lower bound ILP) is much less than the time needed to solve GTED. However, in adversarial cases, c opt could be zero but GTED could be arbitrarily large. We can determine if the c opt is a lower bound on GTED or exactly equal to GTED by checking if the subgraph induced by the solution to (lower bound ILP) has multiple connected components.

Characterizations of the ILP in (11)-(12)
Ebrahimpour Boroojeny et al. [1] propose the ILP in (11)-(12), and we show that the x variables in this ILP have the same feasible region as the x variables in lower bound ILP. However, Ebrahimpour Boroojeny et al. [1] argue that the linear programming relaxation of the ILP in (11)-(12) always yields integer optimal solutions, and therefore the ILP in (11)-(12) can be solved in polynomial time. We provide a counterexample where the ILP in (11)-(12) yields fractional optimal solutions with fractional variable values. Additionally, we show that the constraint matrix of the LP relaxation of the ILP in (11)-(12) is not totally unimodular given most non-trivial input graphs. The details of the proofs and the counterexample are in the Section F of the Appendix.
6. Empirical evaluation of the ILP formulations for GTED and its lower bound 6

.1 Implementation of the ILP formulations
We implement the algorithms and ILP formulations for (exponential ILP), (compact ILP) and (lower bound ILP). In practice, the multi-set of edges of each input graph may contain many duplicates of edges that have the same start and end vertices due to repeats in the strings. We reduce the number of variables and constraints in the implemented ILPs by merging the edges that share the same start and end nodes and record the multiplicity of each edge. Each x variable is no longer binary but a non-negative integer that satisfies the modified projection constraints (13): where where W (C) is the maximum total multiplicities of edges in the strongly connected subgraph in each input graph that is projected from C.
Likewise, constraints (27) that set the upper bounds on the ordering variables also need to be modified as the upper bound of the ordering variable d uv for each edge no longer represents the order of one edge but the sum of orders of copies of (u, v) that are selected, which is at most |E| 2 . Therefore, constraint (27) is changed to The rest of the constraints remain unchanged.
We ran all our experiments on a server with 48 cores (96 threads) of Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz and 378 GB of memory.

GTED on simulated TCR sequences
We construct 20 de Bruijn graphs with k = 4 using 150-character sequences extracted from the V genes from the IMGT database [17]. We solve the linear relaxation of (compact ILP), In summary, it is fastest to compute the lower bound of GTED. Computing GTED exactly by solving the proposed ILPs on genome graphs of size 150 is already time consuming. When the sizes of the genome graphs are fixed, the time to solve for GTED and its lower bound increases as GTED between the two genome graphs increases. In the case where GTED is equal to its lower bound, the subgraph induced by some optimal solutions of (lower bound ILP) contains more than one strongly connected component. Therefore, in order to reconstruct the strings from each input graph that have the smallest edit distance, we generally need to obtain the optimal solution to the ILP for GTED. In all cases, the time to solve the (exponential ILP) is less than the time to solve the (compact ILP).

GTED on difficult cases
Repeats, such as segmental duplications and translocations [18,19] in the genomes increase the complexity of genome comparisons. We simulate such structures with a class of graphs that contain n simple cycles of which n − 1 peripheral cycles are attached to the n-th central cycle at either a node or a set of edges ( Figure 6(a)). The input graphs in Figure 2 belong to this class of graphs that contain 2 cycles. This class of graphs simulates the complex structural variants in disease genomes or the differences between genomes of different species.
We generate pairs of 3-cycle graphs with varying sizes and randomly assign letters from {A,T,C,G} to edges. We compute the lower bound of GTED and GTED using (lower bound ILP) and (compact ILP), respectively. We denote the lower bound of GTED computed by solving (lower bound ILP) as GTED l . We group the generated 3-cycle graph pairs based on the value of (GTED − GTED l ) and select 20 pairs of graphs randomly for each (GTED − GTED l ) value ranging from 1 to 5. The maximum number of edges in all selected graphs is 32.
We show the difficulty of computing GTED using the iterative algorithm on the 100 The distribution of wall-clock time to solve the compact ILP and the iterative exponential ILP on 100 pairs of 3-cycle graphs.
selected pairs of 3-cycle graphs. We terminate the ILP solver after 20 minutes. As shown in Figure 6, as the difference between GTED and GTED l increases, the wall-clock time to solve (exponential ILP) for GTED increases faster than the time to solve (compact ILP) for GTED. For pairs on graphs with (GTED − GTED l ) = 5, on average it takes more than 15 minutes to solve (exponential ILP) with more than 500 iterations. On the other hand, it takes an average of 5 seconds to solve (compact ILP) for GTED and no more than 1 second to solve for the lower bound. The average time to solve each ILP is shown in Table S1.
In summary, on the class of 3-cycle graphs introduced above, the difficulty to solve GTED via the iterative algorithm increases rapidly as the gap between GTED and GTED l increases. Although (exponential ILP) is solved more quickly than (compact ILP) for GTED when the sequences are long and the GTED is equal to GTED l (Section 6.2), (compact ILP) may be more efficient when the graphs contain overlapping cycles such that the gap between GTED and GTED l is larger.

Conclusion
We While the previously proposed ILP of GTED does not solve GTED, it solves for a lower bound of GTED, and we show that this lower bound can be interpreted as a more "local" measure, CCTED, of the distance between labeled graphs. Further, we characterize the LP relaxation of the ILP in (11)-(12) and show that, contrary to the results in Ebrahimpour Boroojeny et al. [1], the LP in (11)-(12) does not always yield optimal integer solutions.
As shown previously [1,12], it takes more than 4 hours to solve (lower bound ILP) for graphs that represent viral genomes that contain ≈ 3000 bases with a multi-threaded LP solver. Likewise, we show that computing GTED using either (exponential ILP) or

A.1 Reduction from ETEW to GTED
We provide below the complete proof for Theorem 1. If GTED(C, G ) > 0, G must not contain an Eulerian trail that spells s. Otherwise, such a trail could be extended to a trail introducing some characters that could be aligned to s with zero cost by aligning gaps with characters.
Hence, an (assumed) polynomial-time algorithm for GTED solves ETEW in polynomial time.

A.2 Reduction from Hamiltonian Path to GTED
We provide below the complete proof for Theorem 2.
Proof. We reduce from the Hamiltonian Path problem, which asks whether a directed, simple graph G contains a path that visits every vertex exactly once. Here simple means no self-loops or parallel edges. Let G = (V, E) be an instance of Hamiltonian Path, with n = |V | vertices. The reduction is almost identical to that presented in Kupferman and Vardi [6], and from here until noted later in the proof the argument is identical except for the technicalities introduced to force unidirectionality (and another minor change described later). The first step is to construct the Eulerian closure of G, which is defined as G = and E is the union of the following sets of edges and their labels: Since G is connected and every outgoing edge in G has a corresponding antiparallel incoming edge, G is Eulerian. It is not unidirectional, so we further create G from G by adding dummy nodes to each pair of antiparallel edges and labelling the length-2 paths so created with x#, where x is the original label of the split edge (a, b, or c) and # is some new symbol (shared between all the new edges). We call these length-2 paths introduced to achieve unidirectionality "split edges".
If such an Eulerian trail exists, then the trail starts with spelling the string a#(b#a#) n−1 , which corresponds to a Hamiltonian trail in G since it visits exactly n "vertex split edges" (type E 1 , labeled a#) and each vertex split edge can be used only once (since it is an Eulerian trail). Further, successively visited vertices must be connected by an edge in G since those are the only b# split edges in G (except those leaving w, but w must not be involved in spelling out a#(b#a#) n−1 , since entering w requires using a split edge labeled c#).
For the other direction, if a G has a Hamiltonian path v 1 , . . . , v n , then walking that sequence of vertices in G will spell out a#(b#a#) n−1 . This path will cover all E 1 edges and the E 2 edges that are on the Hamiltonian path. Retracing the path so far in reverse will use 2n − 1 split edges labeled c#, consuming the (c#) 2n−1 term in q and covering all nodes' reverse vertex edges E 3 (since the path is Hamiltonian). The reverse path also covers the E 4 edges corresponding to reverse Hamiltonian path edges. Our Eulerian trail is now "at" node v in 1 .
What remains is to complete the Eulerian walk covering (a) edges and their antiparallel counterparts corresponding to edges in G that were not used in the Hamiltonian path, and   Proof. Let σ i ∈ T (G 1 , G 2 ) be the 2-simplex corresponding to the entry i of [y i ]. Based on the construction of T (G 1 , G 2 ), σ i has two forms:

A.3 FGTED is NP-complete
Without loss of generality, we assume We can prove this lemma by using the same way when Since We have where is a vector such that all the entries are 0 except that the one corresponding to edge e is 1. we also let [x v ] ∈ R |V | be a vector such that all the entries are 0 except that the one corresponding to vertex v is 1. Therefore, we have Hence, x satisfies the constraint (6) if x satisfies the constraint (6).
In addition, since
We now show that any feasible solution of (5)-(8) is a feasible solution of (11). Let x be a feasible solution of (5)-(8). We show that x is also a feasible solution of (11) by proving that x can be converted to x init in (11)  It is easy to check that x is also a feasible solution of (5)-(8). Therefore, without loss of generality, we assume x to be a vector such that all the entries corresponding to diagonal edges in A(G 1 , G 2 ) are zero.
We then prove that any x can be converted to x init in (11) via the boundary operator.
Let the source and the sink node of x in A(G 1 , G 2 ) be (s 1 1 , s 2 1 ) and (s 1 2 , s 2 2 ), where s i 1 is the source node of G i and s i 2 is the sink node of G i . When the Eulerian trail is closed (meaning that it is an Eulerian tour) in G i , we let s i 1 = s i 2 be an arbitrary vertex in V i . x init can be seen as a trail (tour) in A(G 1 , G 2 ) that starts from (s 1 1 , s 2 1 ), walks along an Eulerian trail of G 2 via all the horizontal edges P h , and then walks along an Eulerian trail of G 1 via all the vertical edges P v , until the sink node (s 1 2 , s 2 2 ). Here is an Eulerian trail of G 1 . We use P 0 = {P h , P v } to denote the trail from (s 1 1 , s 2 1 ) to (s 1 2 , s 2 2 ) that is the concatenation of P h and P v . It is easy to see that each edge in P 0 is unique. For path i, we use a vector x p,i to represent (p i , w p i ), By using the boundary operator, each path p i can actually be converted to a new trail p i such that each edge in p i is also an edge in P 0 . To prove this, we consider the following two cases: • If p i walks along all the horizontal edges followed by all the vertical edges, then every edge in p i is an edge in P 0 . To see that, let e be an horizontal edge in p i , since p i starts Therefore e ∈ P 0 . We can use the same way to prove e ∈ P 0 when e is a vertical edge.
Note that in this case, the number of horizontal edges or vertical edges can be zero. • Now we have a new path, denoted as p 1 i , in which the smallest index of the vertical edges becomes t + 1. Figure 7(a) shows an example, in which the blue line represents the subpath of p i and the red line represents the new subpath in p 1 i .
To create a new vector that represents p 1 i , we first create a zero vector y p,i,1 ∈ R |T (G 1 ,G 2 )| , and from w = 0 to w = k − 1, we iteratively update y p,i,1 via the following equations: The vector x p,i,1 = x p,i + [∂]y p,i,1 is the one that represents p 1 i .
Since the length of p i is finite, by doing such a transformation a finite number of times, we can convert p i to a new path p i such that p i walks along all the horizontal edges first followed by all the vertical edges, therefore each edge in p i is also an edge in P 0 . We use the vectorx p,i to represent p i ,x p,i = x p,i + [∂] q j=1 y p,i,j where q is the number of transformations. Apperantly,x p,i e = 0 when e / ∈ P 0 . Let y p,i = q j=1 y p,i,j , we havex p,i = x p,i + [∂]y p,i .
For cycle i, we also use a vector x c,i to represent (c i , w c i ), Let (v, v ) be an arbitrary chosen node in c i , we construct a trail p i aux that passes (v, v ) as follows: • From (s 1 1 , s 2 1 ), walk along P h until the node (s 1 1 , v ). It corresponds to a part of an Eulerian trail of G 2 .
• From (s 1 2 , v ), walk along the remaining part of the Eulerian trail of G 2 to the node (s 1 2 , s 2 2 ). to itself, and (3) walk along the remaining part of p i aux from (v, v ) to (s 1 2 , s 2 2 ). By using the same way as we described above, each c i + p i aux or p i aux can be converted to a new trail in which each edge is also an edge in P 0 . We usex c,i orx aux,i to represent the new trail accordingly, therefore, we havex c,i = x c,i + x aux,i + [∂]y c,i andx aux,i = x aux,i + [∂]y aux,i .
Likewise,x c,i e =x aux,i e = 0 when e / ∈ P 0 .
We define a new vectorx such that: Therefore,x is a vector converted from x via boundary operations.x is equal to x init because: 1.x e = 0 when e / ∈ P 0 sincex p,i e =x c,i e =x aux,i e = 0 when e / ∈ P 0 for each i.
Combined with the first point, we have thatx e = 1 if e ∈ P 0 andx e = 0 otherwise, Hence, for each feasible solution x of (5)-(8), we have: meaning that x is also a feasible solution of (11).
We proved that the feasibility region of x in (11) is the same as the feasibility region of x in (5)-(8), and since the objective functions of these two linear relaxations are the same, the optimal solutions of them are equal.
By employing the same approach and taking into account that if all edge weights in a flow network are non-negative integers, the flow decomposition theorem guarantees that the network can be decomposed into a finite set of weighted paths and cycles, each with positive integer weight, we can prove that the ILP in (5)-(8) and the ILP in (11)-(12) are also equivalent.
Based on the proof, we can conclude that the way to index the vertices or edges in the alignment graph, or the 2-simplices in T (G 1 , G 2 ), will not affect the equivalence result. Additionally, different choices of orientations for the 2-simplices in T (G 1 , G 2 ) will also not impact the equivalence result. This is because for any two sets T (G 1 , G 2 ) and T (G 1 , G 2 ) containing the same 2-simplices with the same indices but different orientations, if (x, y) is a feasible solution of the ILP in (11)-(12) (or its relaxation) that corresponds to T (G 1 , G 2 ), then (x, y ) is a feasible solution of the ILP in (11)-(12) (or its relaxation) that corresponds to T (G 1 , G 2 ), where y i = y i when σ i ∈ T (G 1 , G 2 ) has the same orientation as σ i ∈ T (G 1 , G 2 ), and y i = −y i when σ i ∈ T (G 1 , G 2 ) has the opposite orientation to σ i ∈ T (G 1 , G 2 ). Therefore, it is acceptable to specify a particular orientation for each 2-simplex when defining T (G 1 , G 2 ).
Appendix C Pseudo-code of the iterative procedure to solve the ILP formulation in Section 4.1 Algorithm 1 Iterative constraint generation algorithm to solve (exponential ILP) 1: Input Two unidirectional, edge-labeled Eulerian graphs and their alignment graph 2: C ← ∅ 3: while true do

4:
Solve the ILP (exponential ILP) with C

5:
if the ILP variables x uv induce a strongly connected component C not satisfying (17) then Define traversal edit distance, edit t (t 1 , t 2 ) as the edit distance between the strings constructed from a pair of trails t 1 and t 2 . In other words, edit t (t 1 , t 2 ) = edit(str(t 1 ), str(t 2 )).
Problem 4 (Closed-Trail Cover Traversal Edit Distance (CCTED)). Given two unidirectional, edge-labeled Eulerian graphs G 1 and G 2 with closed Eulerian trails, compute Here, CC(G) denotes the collection of all possible sets of edge-disjoint, closed trails in G, such that every edge in G belongs to exactly one of these trails. Each element of CC(G) can be interpreted as a cover of G using such trails. M editt (C 1 , C 2 ) is a min-cost matching between two covers using the traversal edit distance as the cost.
CCTED is likely a more suitable metric comparisons between genomes that undergo large-scale rearrangements. This analogy is to the relationship between the synteny block comparison [3] and the string edit distance computation, where the former is more often used in interspecies comparisons and in detecting segmental duplications [22,23] and the latter is more often seen in intraspecies comparisons. We show in the main text that CCTED is always less than or equal to GTED given the same pairs of input graphs.
Following similar ideas as Lemma 1, we can compute CCTED by finding a set of closed trails in the alignment graph such that the total cost of alignment edges are minimized, and the projection of all edges in the collection of selected trails is equal to the multi-set of input graph edges.
Lemma 4. For any two edge-labeled Eulerian graphs G 1 and G 2 , where C is a collection of trails and δ(c) is the total cost of edges in trail c.
Proof. Given any pair of covers C 1 ∈ CC(G 1 ) and C 2 ∈ CC(G 2 ) and their min-cost matching based on the edit distance M editt (C 1 , C 2 ), we can project each pair of matched closed trailed to a closed trail in the alignment graph. For a matching between a trail and the empty item , we can project it to a closed trail in the alignment graph with all vertical edges if the trail is from G 1 or horizontal edges if the trail is from G 2 . The total cost of the projected edges must be greater than or equal to the objective (42). On the other hand, every collection of trails C that satisfy constraint (43) can be projected to a cover in each of the input graphs, and c∈C δ(c) ≥ CCT ED(G 1 , G 2 ). Hence equality holds.

E.1 Correctness of the ILP formulation for CCTED
We show that the ILP in (5) This total cost is greater than or equal to CCTED(G 1 , G 2 ).  12) is not totally unimodular, we first write the LP in standard form.
In a standard form of a LP, all variables are greater than, or equal to 0. Since y vectors in the LP relaxation of the ILP in (11)-(12) can contain negative entries, we decompose it into y + − y − . Given alignment graph A(G 1 , G 2 ) = (V, E, δ) and T (G 1 , G 2 ), we can now write the standard form of the LP in (11) Without loss of generality, let v 0 ∈ V 1 be a node that is incident to at least 3 unique edges. Since G 1 is an Eulerian graph, v must be part of a cycle C in G 1 . Also, there must exist another node v k and an edge between v 0 and v k in either direction, such that the edge between v 0 and v k is not contained in cycle C (Figure 8(a)). Suppose the number of nodes in the cycle is k (k ≥ 3 due to the unidirectionality constraint), and let the cycle Since a specific choice of 1-simplex orientations does not affect the total unimodularity of the boundary matrix, we assume the edge between v 0 and v k is [v k , v 0 ] without loss of generality. We use G sub 1 = (V sub 1 , E sub 1 ) to denote the subgraph with Since |E 2 | ≥ 2 and G 2 is a connected graph, there exist two consecutive, directed edges in G 2 .
We use G sub and is a subgraph of A(G 1 , G 2 ), therefore, each subgraph of A(G sub 1 , G sub 2 ) is also a subgraph of A(G 1 , G 2 ). Similarly, the 2-simplex set T (G sub 1 , G sub 2 ) is a subset of T (G 1 , G 2 ).
The minimal pair of input graphs that satisfy the conditions in Theorem 7 is a graph with one 3-node cycle and one additional edge incident to the cycle and an acyclic, connected graph with three nodes. In practice, most non-trivial edge-labeled Eulerian graphs satisfy these conditions.
According to the definitions in Dey et al.
[24], the subgraph used to construct M 1 in the above proof (Figure 8(c)) is a Möbius subcomplex, and M 1 is a (2k + 3)-Möbius cycle matrix (MCM). Theorem 7 also establishes that there may exist a Möbius subcomplex in an alignment graph, which corrects the false claim made in Lemma 2 in [1].
Theorem 2 in Ebrahimpour Boroojeny et al. [1] attempts to employ a more algebraic approach to attempt to demonstrate that [∂] is TU by establishing that the alignment graph is a Möbius-free product space. However, the property of being Möbius-free globally does not imply the absence of Möbius subcomplexes locally. As we show in Theorem 7, although the alignment graph A(G sub 1 , G sub 2 ) is homotopically equivalent to the one-dimensional circle, which is Möbius-free, it still contains a Möbius subcomplex.

F.2 The LP yields optimal fractional solutions
The fact that [∂] is not totally unimodular does not guarantee that the LP in (11)-(12) has a fractional optimal objective value. In this section, we prove that the LP in (11)-(12) does not always yield integer optimal solutions by constructing a specific example with a fractional optimal objective value.
We prove the above theorem by giving an example where the LP in (5)-(8) yields a fractional optimal solution. Since by Theorem 5, two LPs are equivalent, it follows that the LP in (11)-(12) also yields the same fractional optimal solution. Construct G 1 and G 2 such that their edges and edge labels are equal to the ones specified in Figure 9(a). Let the edge multi-set of A(G 1 , G 2 ) be E. We assign an edge cost to 0 if the edge matches two equal characters and 1 otherwise. Construct vector x * ∈ R |E| and set entries corresponding to edges in Figure 9(b) to 0.5 except edge [(v 3 , v c ), (v 0 , v f )] to which the corresponding entry is set to 1. Set the rest of the entries of x * to 0.
Proof. We prove the optimality of x * via complementary slackness. We first write the LP in (5)-(8) in standard form.
Here, δ is a vector of size |E| where each entry is cost of edge e. The constraint matrix A of the primal LP (45) has |E| columns and |V | + |E 1 | + |E 2 | = m rows, where V is the vertex set of A(G 1 , G 2 ), and E 1 and E 2 are edge multi-sets of the input graphs. The first |V | rows correspond to the constraints specified in (6). The rest of the rows correspond to the constraints in (7) that enforce the projected multi-set of edges to be equal to the multi-set of edges in each input graph. Since the input graphs both contain Eulerian tours, the vector b has size m, where the first |V | entries are zeroes and the rest of the entries are 1s.
We write the dual form of LP (45) as follows.
Let the objective value of LP (45) given a x as input is obj p x , and the objective value of LP (46) given a y as input is obj d y . To show that x * is an optimal solution to the LP in (5)-(8), we need to show that there exists a feasible solution to the dual LP, y * , that satisfies the complementary slackness conditions and that obj d y * = obj p x * .
Since each alignment edge has two endpoints and is projected to at most one edge in each graph, there are at most 4 non-zero entries in each column of A. The variables in y of the dual form can be interpreted in three parts. Each of the first |V | entries of y can be assigned to each vertex in the alignment graph, and the next |E 1 | entries can be assigned to edges in G 1 and the last |E 2 | entries can be assigned to edges in G 2 . There are |E| constraints in the dual LP, and the e-th constraint can be assigned to one edge in the alignment graph has cost δ e . Therefore, each constraint that is assigned to a horizontal or a vertical edge can be written as where i = 1 if e is a horizontal edge, and i = 2 if e is a vertical edge. y v in e and y v out e are the y entries that are assigned to the vertices that are the start and end of edge e, and y e i are the y entries that assigned to the π i (e).
Similarly, each constraint that is assigned to a diagonal edge is y v in e − y v out e + y e 1 + y e 2 ≤ δ e .
We can verify that x * is a feasible solution of the primal form (45) by checking if constraints (6)-(7) are satisfied. The primal objective value can be computed in a straightforward way, and we can obtain obj p x * = 3.5.
According to complementary slackness conditions, since x * e > 0 for edges shown in Figure 9(b), the corresponding constraints in the dual LP (46) must be tight, meaning that the equality must hold in these constraints. The rest of the dual constraints could have slacks.
Let the subgraph of A(G 1 , G 2 ) shown in Figure 9 The projected cycle from C to G 2 is Sum up all the constraints that are assigned edge e where x * e > 0. Since these edges form a cycle, we get: The summed edge cost is 7 as there are 7 edges that are either mismatch edges or vertical edges.
All y entries that correspond to vertices are free variables and are in every constraint.
After fixing the y variables that satisfy constraint (54), the rest of the y variables can be set to satisfy the dual cosntraint. We now obtain y * which is a feasible solution to the dual LP.
The only entries in y * that could have non-zero dual costs are those that correspond to edges in E 1 and E 2 . Since these corresponding dual costs are all 1, obj d y * = e 1 ∈C 1 y e 1 + e 2 ∈C 2 y e 2 = 3.5 = obj p x * .
Since the costs of alignment graph edges are all integers, the fact that the LP in (11)-(12) and the LP in (5)-(8) yield fractional optimal objective values mean that they must yield fractional solutions and assign fractional values to entries in x. Theorem 8 follows. Since the LP in (11)-(12) yields fractional solutions and GTED is always an integer, solving the LP in (11)-(12) does not solve GTED.